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Abstract 

This paper proposes a fast and accurate method for sparse regression in the presence of missing data. The under¬ 
lying statistical model encapsulates the low-dimensional structure of the incomplete data matrix and the sparsity of 
the regression coefficients, and the proposed algorithm jointly learns the low-dimensional structure of the data and 
a linear regressor with sparse coefficients. The proposed stochastic optimization method, Sparse Linear Regression 
with Missing Data (SLRM), performs an alternating minimization procedure and scales well with the problem size. 
Large deviation inequalities shed light on the impact of the various problem-dependent parameters on the expected 
squared loss of the learned regressor. Extensive simulations on both synthetic and real datasets show that SLRM 
performs better than competing algorithms in a variety of contexts. 


1 Introduction 

Modern statistical data analysis requires tools that can handle complex, large scale datasets. Due to constraints in the 
data collection process, one often has incomplete datasets, i.e., datasets with missing entries, with which we need to 
perform statistical inference. For instance, in sensor networks, readings from all the sensors might not be available at 
all the times because of malfunctions in sensors, or simply because it is too expensive to gather readings from all the 
sensors at all the times. Similarly, when conducting surveys, responders may avoid answering certain questions for the 
sake of privacy or otherwise, leading to missing entries in survey data. Recommender systems, implement algorithms 
that are required to train on data with missing entries. For example, popular recommendation engines such as Netflix, 
online radio services such as Pandora, social networks such as Facebook, Linkedin regularly deal with prediction 
problems involving data with missing entries. An ever increasing demand to gather as much data as possible, clean or 
not, in this big-data era, has led to the need for statistical methods that can deal with not just clean data but also noisy 
data with missing components. 

The focus of this paper is on sparse linear regression when the feature vectors or design matrix have missing 
elements. Matrix completion methods allow missing elements to be imputed accurately, but generally do not account 
for any auxiliary label information. Similarly, sparse linear regression and LASSO methods rely upon a fully-known 
design matrix. One might imagine using matrix completion to impute missing entries and then applying sparse linear 
regression methods to the completed design matrix; we demonstrate that this two-stage approach is sub-optimal, and 
propose a unified regression framework that yields significantly better performance in a variety of tasks. 

* gantimahapat@ wisc.edu 

^rmwillett@ wisc.edu 
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1.1 Contributions. 

Our contributions are as follows 


1. In this paper, we propose a statistical model (Section]^ for the problem of sparse linear regression with missing 
data. Our model captures low-rank structure in the data and sparsity of the regression coefficients in the lower 
dimensional representation of the data. 

2. We provide an optimization-based approach that simultaneously learns the underlying subspace structure and 
the sparse regression coefficients (Section]^. Our optimization algorithm, called SLRM, takes a combination 
of stochastic first order and second order steps, alternating between the different parameters of the proposed 
statistical models. 

3. We establish large deviation bounds (Section]^ for the risk of the regressor learned by our algorithm in terms 
of the empirical loss, the ambient dimension D, and a parameter 7 used by our learning algorithm. Using our 
performance bounds we can understand the impact of the amount of missingness on the training error and the 
test error. 

4. We provide extensive experimental results (Section]^ on synthetic and real datasets, comparing the performance 
of SLRM and a competing algorithm. From our experimental results, we conclude that SLRM has good noise 
tolerance properties, and uses the label information well to learn a good regressor, as measured by its mean 
squared error on a test dataset with missing features. 


2 Problem Formulation: Sparse Regression With Missing Data 

Given D-dimensional labeled data with missing features, we are interested in prediction, particularly regression prob¬ 
lems. Let X = {xi,X 2 , ■ ■ ■ ,Xn) & be a data matrix, where the columns have been sampled i.i.d. from a 

distribution. Since we are interested in regression problems with missing data, we do not get to see all the entries of 
the data matrix X. To formalize this notion, let Ui,..., be subsets of {1,2,..., D}. Given an index set U, let 
Pn{x) denote a sub-vector of x consisting of elements whose indices are elements of the set. We observe a dataset 
(Phi (a^i), 2 / 1 , Ui),..., {Pq^ (Xn), 2/n, of size n, i.e., we observe only a few entries of the data points xi,..., Xn, 
where the entries are indexed by the sets Ui,..., f2„ respectively. We call the vector Y = (yi, 1 / 2 , ■ • ■, Vn)^ the 
label vector. Given this training data, we are required to learn a regressor, which when given an unseen test point 
{Pa{x), U), predicts a label y that is close to the true label of x. In order to solve this problem, we consider the 
following statistical model: 


where w* is a sparse vector in I 
is a matrix in We call a,! 


[/* in: 


)xd 


X — cx 

Y = AJw* -I- er, 

{d < D) is a matrix with full column rank, and = [ai*,. 


, a. 


( 1 ) 

( 2 ) 

n!i<] 


the code of Xi w.r.t. the matrix U. The vector ey = (e 


Vl 1 • 


^ is random noise 
..., CxA is a noise 


that is independent of other problem parameters such as C/, A, w, Ui,..., Similarly ex = [^xi, 
matrix with i.i.d. entries, sampled independently of other problem parameters. 

Our statistical model given in Equations |1|2| is motivated by the fact, for many data matrices of interest, even 
though the ambient data dimensionality is large, the data lies close to a lower dimensional subspace of dimensionality 
d. Given, this d-dimensional representation of the data, we are interested in learning a linear regressor with sparse 
coefficients that predicts the labels well. 

To the best of our knowledge, for the problem of regression with missing data, our work is the first work that 
simultaneously exploits both a low-rank structure of the incomplete data matrix and the sparsity of regressor. The 
assumption of a parametric model for our regression problem allows us to go beyond the transductive setting which 
was inherent in the approach of [Goldberg et aL (20101 (as detailed in Section]^. While we consider d < D, we 
are also interested in cases where d is of the same order as D and the regressor is sparse in the lower-dimensional 
representation of the data. This model is relevant to many applications, as described in Section]^ 
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For instance, in a sensor network D sensors listen to d sources. As one would expect, this sensor data is far from 
being “clean”: it is usually noisy, and has missing entries. A common approach in analyzing such sensor data is to 
perform a subspace analysis of the sensor data (Tuncer and Friedlander[ 2009 [ Krim et oTI 1995[ Roy and Kailath 


19891 and find the best fit d-dimensional subspace of the data. For modern sensor networks, both D and d are large; 


that is, a large number of heterogeneous sensors listen to a large number of sources. Exploiting the underlying d- 
dimensional structure during regression yields increased robustness to noise and missing data. 

Notation. Like in the definition of A*, A = [ai, 02 , • ■ •, ctn], A = [di, a 2 , ■ ■ ■, ctn]- Given a matrix M, denote 
Pq{M) as the matrix whose rows are those rows of M whose indices are elements of the set ft. For example, if 
n = {1,3,4}, then Pq{M) has rows 1,3,4 of matrix M. At times, for ease of notation we may write xq^Mq to 
denote Pq{x), Pq{M) respectively. By we represent an identity matrix with d rows. 

3 Related Work 


Our statistical model bears resemblance to the statistical model used in partial least squares (PLS) ( |Hastie etal. \ 2003| l. 
However, unlike PLS we enforce additional sparsity assumptions and can handle missing data. Dictionary learning was 
introduced for unsupervised data analysis for better data representation ([Maurer and Pontil) 2010| [Vainsencher et ah 


[201 Ij). The idea is to learn a dictionary so that each data point could be represented well as a sparse linear combination 


of the columns of the dictionary. Dictionary learning has also been extended to prediction problems (Mairal et al. 


2012t Szlam and Sapiro 2009|l, where the problem is to learn a dictionary for the prediction problem at hand. The 


problem that we tackle in this paper can be seen as learning a dictionary for prediction problems in the presence 
of missing data. Sufficient dimensionality reduction (SDR) ( Suzuki and Sugiyamaj 2013[ Lukumizu et al. 2009| l, 
is a form of supervised dimensionality reduction, where the problem is to find a central subspace Z such that the 
prediction task is independent of the unlabeled data given the projection UzX of unlabeled data onto the central 
subspace. SDR focuses on achieving conditional independence between Y and X given AzX without making any 
assumptions on the functional dependency of the prediction task on the central subspace. SDR does not fully exploit 
linear relationships between labels and features that arise in many practical settings, and the problem of SDR with 
missing data has not been investigated. Loh and Wainwright (20111 investigate non-convex algorithms based on 


maximum likelihood estimation for the problem of high-dimensional regression with missing data. However, they 
work with a different statistical model which does not capture the low-rank structure of the data and assumes that the 
regressor is sparse in the ambient space. In contrast, our statistical model explicitly assumes that the missing data 
matrix has a low-rank structure and exploits this low-rank structure in data to learn a regressor with sparse coefficients 
in the low-dimensional representation of the data. Another closely related work is that of (Goldberg et al. 2010|l, 


where the authors consider the problem of multi-task regression with missing data features and missing labels. The 
authors pose this problem as a matrix completion problem of the matrix formed by the concatenation of the data and 
label matrices. However, the authors deal with the transductive setting only - their approach does not allow one to 
predict a label for a new test datapoint. In contrast, this paper exploits an alternative statistical model for how the labels 
are generated that allows prediction on new test datapoints. Linally, Principal Component Regression (PCR) ( jHastie 
\et al.\ [2003 | l is a dimensionality-reduction based procedure for regression without missing data. PCR first performs 
PCA on the unlabeled dataset, followed by least squares regression in the PCA space. This two-step approach does 
not exploit label information when estimating the underlying low-dimensional model; the limitations of this choice 
are detailed in SectionH) 

As mentioned above, we use stochastic optimization methods that can operate on streaming data to ensure scalable 
algorithms. Thus the low-rank structure in our problem is estimated using techniques drawn from the subspace tracking 
literature. Oja’s method (Oja 1982j l, PAST ( jYangj 1995| l and variations such as OPAST (Abed-Meraim et al. 2000| 
perform subspace tracking when there is no missing data. More recent developments, such as GROUSE (Balzanoef a/.j 
20T0b} and PETRELS ([^ et al.\ |2012|) can handle missing data quickly and accurately. However, these algorithms 


are inherently unsupervised and hence do not directly address the supervised regression problem considered in this 
paper. 
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4 An Optimization Approach And A Learning Algorithm 


Before we describe our optimization based approach to the problem considered in this paper, we discuss a multi-step 
approach (essentially an extension of PCR to missing data problems) that exploits both the low rank of the incomplete 
data matrix and the sparsity of the regression coefficients: 


1. Solve the following optimization problem 




J2\\PnAxi) - PnAUc 


li 


(3) 


The above problem aims to consider a decomposition of the incomplete data matrix X as the product of two 
matrices U, A, such that the Frobenius norm of the difference between X and UA over the observed entries is 
minimized. This problem has been studied in the matrix completion literature ( Koren et al.\ 2009[ pmn et aL\ 
2013| l, and in the subspace identification and tracking literature (Chi et al. 2012 |Hua et at. y |1999| l. A standard 


approach to solving this problem is via alternating minimization, where we alternate between optimization w.r.t. 
U and the vectors ai,..., In the special case that for alH = 1,..., n, = {1, 2,, D} (as in classical 
PCR), the solution to the above problem is obtained by performing PCA of the data matrix. 

2. Let U, A = [ai,..., Q!„] be the solution of 0. Learn a linear regressor with sparse coefficient, using A as the 
design matrix and by solving the following £i penalized problem 


w = arg min — \\Y — A 

w n 




A||ti;| 


(4) 


We call the above two step procedure MPCRQ Note that in Step 1 of MPCR, the label data is not used. A merit 
of MPCR over other approaches previously proposed for our problem is that MPCR explicitly utilizes the low-rank 
structure of the data, and a linear model for the regression task at hand. 

However, such multi-step algorithms that do not utilize the label information in all the steps are inherently label- 
inefficient. First, such multi-step algorithms fail to exploit information about [/* reflected by the labels. Second, the 
estimate U is one basis (of many potential bases) of the underlying subspace. Since we perform sparse regression on 
the subspace coefficients, the choice of basis matters. However, without label information, we have no way of knowing 
which basis rotation is best. Third, MPCR solves a harder problem than necessary. To see why, note that when is 
sparse, then for the purpose of prediction, only those rows of A* and columns of (7* that correspond to the non-zero 
coordinates of w* matter. 

In general, any multi-step procedure that does not utilize label information when estimating the underlying sub¬ 
space will be label-inefficient for learning a good predictor. This is particularly true when both D and d are of the same 
order, as in the sensor network problems described in Section]^ This is because when d is comparable to D, there is 
a good deal of information in the labels that can be used to efficiently estimate the underlying subspace. We observe 
this in our experiments too, where on the CT slice dataset, where D = 384, d = 181, MPCR gives substantially worse 
performance than our proposed algorithm. 

Armed with these insights, we are interested in procedures that utilize label information fully. We do this by 
proposing a joint optimization procedure that simultaneously learns all the relevant variables in our model. 


4.1 Learning Via Joint Optimization 

Given constants Ai, A 2 , A 3 > 0, we propose to solve the following optimization problem. 

A "■ 1 

minimize — \\Pni{xi) - PQi{Uai)\\l+ -\\Y - A^w\\l + A 2 ||ti;||i + A 3 ||w ||2 
u,A,w n ^' n 

i=l 

subject to U = Id, 

’ M in MPCR stands for missing 
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where U G 


pDxd 


, A — [ai,..., a„] G 


,w G 


Like in MPCR, the first term corresponds to a matrix 


completion term. The second term in the above optimization formulation measures the squared loss of a regressor 
w on a low-dimensional representation of the training data. The third term in our optimization formulation is the 
norm penalty which encourages sparse w. Finally the last term is motivated by elastic net type formulation for sparse 
prediction. We optimize over 17 , w, A, under the constraints that the columns of U be orthonormal to each other to 
ensure uniqueness of the solution. The above optimization procedure outputs w, U, A. Given an unlabeled data point 
with missing entries, {Pn{x), 17), and a constant 7 G [0,1), we first project the point onto the subspace spanned by 
the columns of the matrix I/q, to obtain x = (C/q l7a)~^l7Q xq. Our regressor, 0 ^ then predicts the label of 
{Pn{x),n)as 

( 6 ) 




t(l-T) i 


where Ipj is the indicator function. Whenever | \{Uq Ua)~^\\ is large, it implies that the missing entries will not allow 
accurate subspace projection; in this case, our method outputs 0. A good choice of 7 depends on d and |17|, and we 
shall discuss this in detail in Section|5] 


4.2 Solving The Optimization Problem 

The optimization problem shown in (|^ is individually convex in the optimization variables U, A, w, but jointly non- 
convex. We solve this problem via an alternating minimization approach, where we minimize over 17, A, w alterna¬ 
tively. In addition, we adopt a stochastic optimization approach. Our algorithm is called Sparse Linear Regression with 
Missing data (SLRM). SLRM makes a pass over the dataset, and each time uses a single data point {PQ_i.{xt),yt, ^t) to 
make updates to all the parameters. SLRM uses stochastic second order steps to update matrices U, A, and stochastic 
first order steps to update w vector. Algorithm |4.3| provides a pseudocode of our proposed stochastic optimization 
algorithm. There are six main steps, which we shall discuss below in detail. 

Initialization. In Step 1 we initialize [7, w to Uq, wq. Uq is obtained by performing SVD of the incomplete data 
matrix with O’s filled in the missing entries. The left singular vectors corresponding to the top d sin gular values form 


the f7n matrix. Similar initializa tion techniques have been proposed in matrix completion literature (Jain et al. 


Hardt 

2013 

Koren et al. 

2009 


2013 


,i?^ to 


initialize wq by solving the LASSO regression on Y and Aq, similar to We initialize matrices, i?)*, i? 2 , ■ 
a multiple of the identity matrix. These matrices are required in Step 6 of our algorithm. 

Updating A. In round t, SLRM uses (Pq^ (xj), yt,^t) to update our estimate of the A matrix. Since (Pq^ (xt), yt,^t) 
is only responsible for the 7* column of matrix A, in Step 5 of SLRM we replace the 7* column of At-i, with at, 
to obtain At- This update reduces to a simple unconstrained quadratic optimization problem over at, which can be 
solved in closed form by solving a system of linear equations. 


Updating U. In Step 6 , we update Ut-i by using the MODIFIED-PETRELS (MP) routine. The MP routine is 
inspired by the PETRELS algorithm ( Chi et a77||2012| l, which was designed for estimating subspaces from streaming 

data with missing entries. PETRELS can be seen as solving the optimization problem argmin^"^^^ fi{U), where 

u 

fi{U) = \\Pfi.{xi — Pai)|p, and a’s correspond to a projection of the observations onto the current subspace esti¬ 
mate. MP solves the same optimization problem, but with the at from Step 5, which uses label information. Both 
methods update JJt-i to Ut by performing a single stochastic Newton step on ft{U), starting at Ut-i, and using 
Pfit (xt), Vlt,at. This Newton step can be implemented efficiently using recursive least squares, and a pseudocode for 
the MP routine is available in Algorithm 2. 

Orthonormalization of updated U. Since we are optimizing over the manifold of rectangular matrices with 
orthonormal columns, we perform an orthonormalization step in Step 9, by solving the following nearest orthogonal 

matrix problem: Ut = arg min 11C7 — C/t 11 subject to U = Id- This problem has the closed form solution as shown 

u 

in Step 7 of SLRM. Note that by construction, our orthonormalization step always guarantees, that the columns of Ut 
always span a d-dimensional subspace of 


Updating w. In Step 8 , we perform one step of the stochastic projected gradient algorithm w.r.t. w. Our objec- 
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tive function is — 11 y — AJ w \\2 + AaHwHn +A 2 ||u'||i . A step of the stochastic projected gradient method requires us 
n 

" -V-^ 

F{w) 

to calculate a noisy estimate of, WF{wt-i), using at, yt, followed by an application of the prox operator correspond¬ 
ing to lliulli. 

Validation Steps. We let w and U denote the estimate stored at the end of the previous round. In Steps 9-12, we 
determine, using a hold-out validation set, whether w and IJ form a better regressor than wt and Ut- The pair that 
achieves smaller hold-out error is then stored as w and U for the next round. 

These steps are required since we are solving a non-convex optimization problem, and hence it is not necessarily 
true that {wt, Ut) leads to the best regressor. 

Note that SLRM can easily be modified to handle the case where we have semi-supervised data. If we get unlabeled 
data in a round t, then we perform the optimization problem in Step 5 of the SLRM algorithm without the term 
{yt — wj_ia)'^, and simply skip the weight update in Step 8 . 


4.3 Computational Complexity and Convergence 

Step 5 of SLRM solves a system of linear equations and takes 0{\ilt\(P) time. Step 6 of SLRM allows a parallel 
implementation, where the rows of the matrix Ut are updated in parallel. This takes 0{\flt\cP) time. Finally, Step 
7 of SLRM is the classical orthogonal Procrustes problem and takes 0{D(P + (P) time. Hence, all together the 
time complexity of our algorithm is 0{\nt\(P + D<P). In particular, since steps 5,7 are well studied numerical 
problems, they have efficient numerical implementations available. Since, our algorithm is built on exploiting the 
low rank structure of the missing data matrix, we attempt to get a rough estimate of the subspace spanned by the 
data. Algorithms that attempt to estimate the subspace spanned by missing data such as GROUSE ([Balzano et al. 


2010b I, PAST (Yang 1995 i need to expend 0{\nt\cP) computation. Hence, it appears at least 0(|Ht|d^) amount of 


computation is inevitable. The overall higher computational complexity of our SLRM over algorithms such as PAST, 
GROUSE etc. is because of the additional prediction task that we aim to solve with SLRM. 

SLRM, like other task driven dictionary learning approaches ( Mairalef a7||201^ , uses a combination of stochastic 
updates and alternating minimization for a non-convex objective function. Empirically we observe similar convergence 


behavior to that reported in (Mairal et al. 2012 1 ; to the best of our knowledge, no formal convergence guarantees 


are available for SGD based approaches to the task driven dictionary learning problem ( Mairal et a/.||201^ . Note 
that Mairal et al. ( 2009 | l also uses stochastic updates and alternating minimization for a biconvex objective and de¬ 
scribes associated convergence analysis; however, the problem considered in that paper is far simpler than ours, in that 
it was not task-driven and didn’t handle missing data. 


5 Generalization Error Bounds 

Definition 1. Given a 7 G [0,1), /et 


and 


U'-i = {fw,u\'w G U G lltulli < Ri,U^U = Id, fw,u A os given in Equation^. 

Our main theorem is as follows 

Theorem 1. Consider a regression problem where a training set of n data samples {PQ.{xi),yi,Ili) are sampled 
i.i.d. from a probability distribution, with \yi\ < By, Ha^iHoo < Bx, almost surely. Leteachlli be a set of cardinality 
m, chosen uniformly at random with replacement from the set {1,2,..., D}. Let b = 2{By + . Choose a 
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Algorithm 1 SLRM. Input: Parameters Ai, A 2 , A 3 , d > 0, 0 < 7 < 1, Output: w, U 

1 : Initialize w = Wo,U = Uq,A = Aq, (^ 2 )^ = • ■ •, = ^^d- 

2: Initialize curr-best_val-err = 00 . 

3: for t = 1, 2,. .. n do 
4: Receive {Pn^{xt),yt,^t) 

5: Replace t* column of At-i with ctt to get At- at is given by, 

at = argminAi||Paj(a;t) - PnAUt-ia)\\l + {yt - wj_^af 

a. 

6: Update U, using the current sample {PQ_^{xt),^t), as follows 

Ut,RlRl...,R^D^ Modified-PETRELS(Ut_i,Pn,(a:t),Ui,dt, ..., 

7: Orthonormalize by, Ut 3— Ut{UjlJt)~^^'^ 

8 : Perform stochastic proximal gradient type update using the following equations 

wt = prox^^^^ l|.|l^ wt-i - yti^i^atajwt-i - ytat) + AaWt-i^ 

9: vaLerr = Validation-Error(rt;t, Ut,A) 

10 : if val-err < curr-best-val-err then 

11 : curr_best_vaLerr = vaLerr. 

12: W i — tCi, U i — Ut 

13: end if 

14: end for 
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7 e [0,1). Let L{f) = i J2i=iiyi - f{Po.i{xi),Q.i)Y,L[f) = - f{.PQ.{x),Ll)Y Then for any i5 > 0, and 

a universal constant K > 0, we have with probability at least 1 — <5, over a random sample of size n, for all f € IF-y, 


L{f) < L{f)+K 


ml 


m 

n 


1 A DRiBx ^ / 51og(l/(5) \ ^ b\og{l/S) / DRiBx 

^/n J 1 —7 \ n J n \n ^/n) \ 1 — 7 

( 8 ) 


For appropriate values of Ai, A 2 , A 3 , i?i, the output of SLRM ^ ^ S The complete proof is in the appendix. 
Here, we shall provide a brief synopsis of the proof. 


Algorithm2 MODIFIED-PETRELS. Input: Ut-i,Pn,{xt),Llt,at,R\~^ ,... . Output: Ut,R\, ... ,R^jj 

1 : for j = 1,... ,£) do 

2 : P] = l + aJ{R]mat 

3: v) = (i?f 

4 : p) = l\j&Ht] 

5 : - p]{p])m^v]y 

6 : Ut,j = Ut-I,j +pj{xt,j - aj(jt-i^j){Rj)'^at 

7: end for 


Proof Sketch 1. Our proof uses standard large deviation results connecting L(f) and L{f), similar to (Srebro et al. 


2010 Thm. ])). Wfe upper bound the Rademacher complexity of the function class Lemmas 2 and 3 in the appendix 


show how to perform these calculations. 


We would like to remark that in Theorem we assumed that |Oi| = m for all i. This assumption is only a 
technical convenience and allows us to state our result in the cleanest possible way. In general, we wish to choose 7 
so that both (a) the empirical error L{f^ ^ is small and (b) the R.H.S. of the inequality in Theorem 1 (which scales 
like (1 — 7 )“^) is small. A similar trade-off can also be found in structural risk minimization, commonly studied in 
classical supervised learning, where we know that functions belonging to a richer class have smaller training error, but 
potentially larger upper bounds on their generalization error. 

Specifically, the R.H.S. of the inequality in Theorem 1 depends on a term of the form y. This term can 

be roughly thought of as a measure of the complexity of me function class, R.y. A large 7 would imply that we are 
learning from a richer class of functions and, as can be seen from Theorem[2 the upper bound on the risk of functions 
in the class will be potentially larger. Thus we wish to keep 7 as small as possible. 

On the surface, it may appear that 7 must be close to one to yield a small empirical error (by not predicting zero 
values). However, in many settings, it is possible to choose a small value of 7 and still have a low empirical error. 
To see this, note that larger m = |Hi| leads to easier learning problems, and hence smaller error rates. Let S be the 
subspace spanned by the columns of U. Let p{S) = ^ maxj 11 ^ 56 ^ 1 ^, where Pg is the projection operator onto S, 


and e, is the standard basis element in D dimensions. p{S) is known as coherence of subspace S (Candes and Recht 


20091. It is well known (Balzano et al. 


2010 aI that ||(t/f{(7n) 


- 11 


< 


, with probability at least 1 — 5 over 


the random choice of Q, where 73 = y '■ Hence, it is enough to set 7 > y From our 

previous discussions, we know that a small 7 would mean that the complexity of J 7 , is also small. To see how this 
affects L{f^ ^ ^), notice that because with probability at least 1 — 5, 11 (17^ 11 < we can claim that on 

an expectation we are guaranteed to make a non-zero prediction on less than a fraction 6 of our training examples. 
Hence by choosing a large m, we are guaranteed that we can work on a sufficiently small function class J 7 ,, and yet 
not incur a large training error. This implies from Theoremj^ that L{f^ q is small. Hence, the correct choice of 7 
depends on m, and a suitable choice of m depends on d. We now have a nice interplay between the number of random 
measurements, m, the prediction error of the final regressor, the training error of the regressor, the ambient dimension 
D, and the intrinsic dimension d. 
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(c) Non-zero cr^, cr^ (d) Impact of p, probability 

of feature being observed (not 
missing) 

Figure 1; Comparison between SLRM (solid, red line) and SMPCR (broken, blue line) on synthetic datasets for 
D — 100 and d — 30. 


(a) Noiseless 


(b) Non-zero crj; cr^ = 0 


6 Experimental Results 


Experimental Setup. We generated datasets of increasing size, with D = 100 and d = 30. These datasets were 
generated by hrst generating a common U matrix of size D x d with random, orthonormal columns. For a given 
dataset size, five different A matrices of size d x n were generated by sampling each entry from a standard normal 
distribution. Separate validation and test datasets were also generated by generating additional random A matrices, in 
the same way as described before. We generate random ex with each entry of matrix ex (resp. ey) having mean zero 
and variance (resp. ay). In order to simulate missing data, we retain each element of each observed feature vector 
with probability p, and in the test and validation datasets with probability q. While in the theoretical results, for ease of 
analysis we assumed that the set fl, is chosen uniformly at random, with replacement, from the set {1, 2,..., D}, for 
our practical implementations, we choose 17 of size m by choosing each feature with probability p. Hence, Em « pD. 
Previous analyses have shown that these two sampling strategies behave similarly ( |Recht| |201 l| l. All the results 
reported here are averaged over the hve different random datasets that we generated. Similarly, w vector used in our 
model was generated at random from a Gaussian distribution, and random coordinates of w were set to 0. The sparsity 
level of w was set to 10. 7 is set to 0.001. Ai, A 2 , A 3 are chosen by using a held-out validation set, and searching for 
parameter values which give the smallest MSE. We found that the performance of SLRM is not very sensitive to the 
values of A’s, and hence a coarse range is enough during validation, d is set by performing PCA on a subset of the 
data, and calculating how many dimensions are required to capture about 99% of the variance 

We compared our algorithm with a stochastic version of MPCR (PCR modified to handle missing data, detailed in 
Section]^, which uses the PETRELS algorithm to perform Step 1 of MPCR, and then follows it by a stochastic pro¬ 
jected gradient method to solve the LASSO problem in Step 2 of MPCR. We shall call this stochastic implementation 
SMPCR. 


Eor both SLRM and SMPCR, we allow multiple passes over the dataset in our experiments. The maximum number 
of passes is fixed to 500 for both SLRM and SMPCR. % used in Step 8 of SLRM is chosen to be a constant, p, for 


a fixed number of rounds, and then allowed to decay as pjt. This strategy has also been used advocated in (Murata 


1998[ Mairal et al. 2012|l, and we use this method in our algorithm. The value of p was found by trying a range of p. 


and choosing the one that gave the best error rate over the hold-out dataset. 

Experiments in the noiseless setting. In our hrst set of experiments, we set Ux = ay = 0. Ligure 1(a) shows 
the error bars for the mean squared error (MSE) on the test dataset for both SLRM (in solid, red line), and SMPCR 
(in broken, blue line). As we can see from the hgure, the performance of both SMPCR and SLRM improves with 
increasing n. Ligure [T(a)] also indicates that the average MSE of SLRM is lower than that of SMPCR for all n. 

Impact of non-zero a^. While the above experiments, demonstrate the superior performance of SLRM over 
SMPCR in the noiseless setting, it does not tell us how these algorithms perform in the presence of noise. We shall 
now study the impact of non-zero a^ on the performance of both SLRM and SMPCR. Lor a clearer understanding, we 
set ay = 0, and hx the size of the dataset to n = 12800. Ligure 1(b) shows the impact of increasing cr^ on the MSE 
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(a) Real datasets (b) ATPld (c) ATP7d 

Figure 2; Comparison between SLRM (solid, red boxes) and SMPCR (broken, blue boxes). 


of SLRM, and SMPCR. As we can see from this figure, the MSB of both SLRM and SMPCR gradually increases 
with increasing cr^. Like in the noiseless setting, the MSB of SLRM is always substantially smaller than the SMPCR 
method. 


Impact of non-zero a. 


^ ay. We shall now examine the effect of non-zero values for cr 




on the MSB of SLRM 


and SMPCR. In these experiments, we fixed the size of the dataset to n = 12800, and increased a. 


x'l 


Bor the sake 


the MSB of both SLRM and SMPCR increases with 


of simplicity, we keep = cr^. As we can see from Bigure 1(c) 
increasing ax, ay. In this case too, the MSB of SLRM is substantially smaller than that of SMPCR. Prom our plots, 
SLRM seems to be more noise tolerant than SMPCR. 

Impact of increasing p. In this experiment, we examine the impact of increasing p on the error rate of the proposed 
learning algorithms. We hx a'^ = a"^ = 1, and the size of dataset to n = 12800. Like in previous experiments q is set 
to 0.75. As expected, the error rate of both SLRM, and SMPCR goes down as p increases. 


6.1 Experimental Results on Real datasets 

We performed experimental comparisons on 15 real world tasks. While an extended discussion of our datasets, has 
been relegated to the appendix, in this paper, we shall provide a brief description of the datasets. Our datasets are 
Leukemia, CT Slice, ATPld, and ATP7d. Both ATPld, and ATP7d ( [Groves and Gini|[2011| | have six tasks each related 
to airline ticket price prediction. The CT-slice dataset consists of 384 features obtained from CT scan images and the 
task is to estimate the relative location of the CT slice on the axial axis of human body. The Leukemia dataset ( jGolub j 
et a/.||T99^, and the colon-cancer dataset are high dimensional datasets with D = 7129 and D = 2000 respectively. 


Both these datasets have binary labels, { — 1, -1-1} as the target but for this paper we treat it as a regression problem. 


Prom Pigures 2(a) 2(c) it is clear that the median MSE of SLRM is always lesser than the median MSE of 
SMPCR on all the tasks. When labels are quantized, such as in classification problems, or noisy, the advantage 
gained from utilizing label information is limited. This explains why on the leukemia and colon cancer datasets, 
SLRM might, at times perform worse than SMPCR. As mentioned in Section]^ the superior performance of SLRM 
over SMPCR, on the CT slice dataset can be explained by the fact that for CT slice dataset, both D = 384 and <7=181 
are large. On both ATPld and ATP7d datasets, on almost all of the tasks, SLRM far outperforms SMPCR. 


7 Conclusions and Future Work 

This paper studies the problem of regression with missing data. We proposed a new statistical model and an opti¬ 
mization based approach for learning the parameters of the model. We established risk bounds for our regressor, and 
demonstrate superior empirical performance over competing algorithms. This work can be extended in several ways. 
Instead of a single subspace assumption, it should be possible to extend our framework to handle the case when data 

^The horizontal bar in the baiplots show the median MSE of the method 
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is generated from a union of K subspaces, using ideas in (Xie et al. 2012| l. Our framework can be extended to handle 
other tasks using different loss functions and to multi-task learning problems via the use of appropriate matrix norms. 


A Towards Proof of Theorem [I] 


In this appendix we shall prove the result of Theorem [T] In order to establish Theorem [T] we need the following few 
important definitions and results which have been taken from Srebro et al. ( 2010|l 


Definition 2. Let ai:„ be a collection of Rademacher random variables. The worst case empirical Rademacher 
complexity of a function class T is defined as 


T^n{^) = sup sup -| h{zi)ai 

^i:r. f&J^n 


(9) 


Empiricial Rademacher complexity is defined as 


( 10 ) 


Lemma 1. Let I be an H smooth non-negative loss, such that Vj/i, 1/2,2/3) \l{yi,y 2 ) — ((2/3) 2/2)! < b. Let L{f) = 
¥,z^yl{f{z), y), and L{f) = ^ yi)- Then for any S > 0, we have, with probability at least 1 — h, over a 

random sample of size n,for all f £ T, 


L{f)<L[f) + K 


ffif) log“nK„(^) + 


Hlog^nTZliT) 


b\og{l/S) 


( 11 ) 


The above lemma was proved in |Srebro et al ( |2010 1 , and was used for prediction problems without missing data. 
We shall use the above result for our problem by using z = (x, Llj;). This will enable us to provide generalization 
bounds for our regression problem with missing data. 

Lemma 2. \Srebro et al.|(|2Q7^ For any function class T, containing functions f : Z ^ M., we have that 


= inf 

q :>0 


4a -f 10 


logAf2(e,J',Zi:r,) 


de 


Lemma 3. 

/50m 14 \ SDRiBx 

Proof. For the sake of convenience, let B = . Also throught this document, for the sake of conciseness, we 

shall use the pair (x, (1) to mean (Po(x), fl). Similarily (xi, Eli) would mean (Pq. (xi), Eli) Instead of bounding the 
Rademacher complexity of P, we shall work with a slightly different function class Pg = {/(x, 17) = /3^xo : \ \fi \\2 < 
P}. It is now, enough to control the Rademacher complexity of the function class Fg, since all functions / € 
can be written as /(x, 17) = /r^xn, for an appropriate p where, by definition of F-j, and using Cauchy-Schwartz 
inequality, we can guarantee that ||/r ||2 is upper bounded by B, and hence TZn{Fj) < TZn{Fg). The rest of the proof 
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upper bounds TZn (J^g). We control this Rademacher complexity via lemma|^ Let /i, /2 G J^g- 


d2{fl, f2,Xl-.n,^l:n) — 


1 " 

. - f2{xi,n,)y 


1 ” 

A Xi^n^ - ^2 


< 


n 

\ -E 11/^1 

\ i=i 




( 12 ) 


(13) 


(14) 


(15) 


where, in order to obtain Equation[l^from 14 we used the fact that | \xi^a,^ I li ^ 'J ^11^11^ To upper bound the above 
R.H.S. by e, we need ||/3i — /32|| < —. Hence, to cover Jg, it is enough to cover an ^2 ball of radius B, with 

f 2 ball of radius — ■ Now, we know that the cover a ball of radius R, with balls of radius e, in d dimensions we 

V™||X||oe 

need Using this result, we can conclude that Af 2 {Bg, zi:„) < ^ 

entropy integral, and using lemma we get 


3 By/m\\X\\i 


) m 

. Plugging, this into the Dudley 


'^n{d-g) < min4a + 10 


r™P/e^g 


'JHP) 


Q !>0 


Im ^^^f iDRi\\X\\, 
n ° \ e 


de 


(16) 


For the sake of simplicity, let us denote by U = sup^gjr^ ■\JE{f'^), and by C = L L easy to see that 

F < < C. With this notation, the above inequality can be manipulated as follows 


— m(l—7) 


10 

< 4a + ^ / ■y/m(log(C0--^og(e)) de 

VnJa 


< 4a — 


2QCy/Fi /■\/i°g(G)-iog(F) 
Vd J ^log(C)-log(a) 


02 exp(-02) de 


(17) 

(18) 


where the above expression is obtained by the change of variable, 9^ = log((7) — log(e). Substituting K 2 = 
•y/log(C') — log(F), for the upper limits of the integral appearing above, we get 


T^n{d^g) < 4a - 


< 4a — 


r-fs '2 


2QCy/r 

Vd Jy/F^) 


02 exp(-02) d0 


(19) 


lOa-y^m / f C 


a 


log ( - 1 +567,/ —erf I Jlog ( - ) I + 10 CJ-K 2 e-^- - 567,/— erf/i^a) 


<4a + 1067\/—+ 567^/ —erf I ^/log( - 


(i) 


( 20 ) 


where last equation was obtained by using the inequality xe ^ and by dropping all the negative terms. We 


can now optimize over a, by setting the gradient to 0, to get 


a* = C exp 


/ 4n + 25m + y/lQn? + 200mn \ 


50m 


( 21 ) 
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Substituting a* for a in Equationand over-estimating erf(-) by 1, we get 


'Rn[Tg) < IOC 


2 en 


5C 


mTT 4C _ 0.16Ti 
-I- 

n Je 


Replacing C, by its definition, we get 

10 


T^niJFg) < 




-b -b4e 
en x/n 


( 22 ) 


(23) 


1-7 (l-7)Vn 

Since the above quantity is independent of the sample, hence the above bound on TZ„{Tg) also holds for TZ{Fg), i.e. 




( — 
V \/2en 


o.i 6 ti \ 3Zliii||2f||oo f 50m 14 \ ODRiBx 


4e 


1-7 


< 


\ n y/nJ (I- 7 ) ■ 


(24) 


□ 


Proof of Theorem [J The theorem now follows immediately from Lemma the squared loss, which is by 
dehnition 2-smooth and Lemmaj^ The role of z is played by the random pair (x, 11). The following trivial bound for 
b, required in Theorem [T] holds 


b < \l{yi,y2) - Kyi,y3)\ < |((i/i,2/2)| + \Kyi,y3)\ < 2 Sy -b 


DRi 

r(l- 7 ) 


(25) 


B Description of datasets 

In this appendix we provide information regarding the datasets that were used in our experiments in Section We 
shall provide a description of the datasets that were used in Section 6 . 

1. Leukamia dataset. The Leukamia dataset was obtained from http : / /www .csie.ntu.edu.tw/~cjlin/ 
libsvmtools/datasets/ It is a cancer classihcation dataset [Golub et al] ( |1999| l where the features are 
gene expression levels, and our task is to classify whether the gene expression levels indicate acute myeloid 
leukemia or acute lymphoblastic leukemia. This dataset comes with separate train and test datasets of a total of 
72 points with D = 7129. For our experiments we merge both the train and test datasets, and randomly sample 
30 data points for training, 22 data points for testing, and the remaining for validation. This was repeated hve 
times to obtain five different training, test and validation datasets. On each of the training sets, we retained a 
feature with probability 0 . 1 , to simulate the missing data scenario. 

2. CT slice. The CT slice dataset was obtained from https ://archive . ics . uci . edu/ml/datasets/ 
Relativetlocation+of+CT+slices+on+axial+axis The dataset consists of 384 features ob¬ 
tained from CT scan images. The task is to estimate the relative location of CT slice on the axial axis of 
human body. The original dataset consisted of 53500 data points. For our experiments we sampled 300 data 
points uniformly at random, of which 100 each were used for training, testing, and validation. This process was 
repeated five times, to obtain hve different datasets. 

3. ATPld, ATP7d. The ATPld, ATP7d datasets were obtained from http : / /mu Ian . source forge . net/ 
datasets-mtr . html The ATPld and ATP7d datasets are airline ticket price prediction datasets, where the 
problem is to predict the prices for six target Right preferences, namely the price of any non-stop Right, Delta 
airlines. Continental airlines, Airtran, and United airlines. While for the ATPld dataset these target prices are the 
next day price, for ATP7d the targets are the minimum price observed over the next 7 days. The input features 
for each sample are values that are useful for prediction of the airline ticket prices for a specific observation date- 
departure date pair. The features include quantities like day-of-the-week of the observation date, number of days 
between observation date and departure, and several other price related features such as minimum quoted price, 
mean quoted price etc...In order to normalize our features, we divided all our price related features by 1000, 
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and the feature which measures the number of days between observation date and the day of departure by 180. 
See Spyromitros-Xioufis et al. ( 2012| l; Groves and Gini ( 201 l| l for more details on these datasets. We converted 
the cardinal day-of-the-week feature into a 7 dimensional boolean vector. The sizes of our training, testing, 
and validation datasets are 200,46,50 respectively. These were obtained via random sampling from the original 
dataset. 
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